Overview

This report provides an evaluation of the accuracy and precision of probabilistic nowcasts and forecasts of weekly number of confirmed influenza hospital admissions submitted to the FluSight Hub. Some analyses include forecasts submitted for 14 weeks, starting in October 11, 2023. Others focus on evaluating “recent” forecasts, submitted only in the last 4 weeks, starting in December 20, 2023.

The US Centers for Disease Control and Prevention (CDC), collects short-term forecasts from dozens of research groups around the globe. Every week CDC combines the most recent forecasts from each team into a single “ensemble” forecast for each of the targets. This forecast is used as the official ensemble forecast of the CDC, typically appearing on their forecasting website on Friday.

This report evaluates forecasts at the state level for weekly number of confirmed influenza hospital admissions for 0 to 3 week horizons, using similar methods that were employed for COVID-19 Evaluation Reports. Data by CDC on healthdata.gov (details here) is used as ground truth data for evaluating the forecasts.

We evaluate models based on their adjusted relative weighted interval scores (WIS, a measure of distributional accuracy), and adjusted relative mean absolute error (MAE). Scores are aggregated separately for the most recent 4 weeks and for entire 2023-2024 season. To account for the variation in difficulty of forecasting different weeks and locations, a pairwise approach was used to calculated the relative adjusted WIS and MAE,to attempt to adjust for teams submitting forecasts for different subsets of weeks, locations and horizons. Models with relative scores lower than 1 have been more accurate than the baseline on average, whereas relative scores greater than 1 indicate less accuracy than baseline on average.

We generated scores in two ways, with the raw counts and with the log transformed counts. It has been argued that the log-transformation prior to scoring yields epidemiologically meaningful and easily interpretable results, while also reducing the impact of high-count locations on aggregated scores Bosse et al. (2023).

New Hospital Admission Forecasts

Raw counts

These evaluations are based on raw counts.

Summary Tables

These tables evaluate forecasts in the four most recent weeks, and historical accuracy for all forecasts submitted in the current season. The first two tables evaluate forecasts based on their WIS and MAE, overall and by horizon. The last two tables evaluate prediction interval coverage rates, overall and by horizon.

Inclusion criteria for each column are detailed below the table.

Recent accuracy

To calculate each column in our table, different inclusion criteria were applied. This table only includes forecasts for the last 4 weeks, since December 23, 2023. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.

Historical accuracy

To calculate each column in the table, different inclusion criteria were applied. This table includes forecasts for the last 14 weeks, since October 14, 2023. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.

Recent coverage

This table only includes forecasts for the last 4 weeks, since December 23, 2023. For inclusion in this table, the models must have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their 95% PI coverage, with the models whose empirical coverage rates are closest to 95% at the top.

Historical coverage

This table only includes forecasts for the last 14 weeks, since October 14, 2023. For inclusion in this table, the models must have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their 95% PI coverage, with the most accurate models aggregated across horizons at the top.

WIS components

The data in this graph has been aggregated over all locations and submission weeks. We only included forecasts for the last 4 weeks. The models included have submitted at least 50% of forecasts during this time. This is the same exclusion criteria applied for WIS scores in the recent evaluation period.

The sum of the bars adds up to the WIS score. Of note, these values may not be exactly the same as the relative WIS scores shown in the leaderboard table because these are not adjusted for weeks or locations missing. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons. The y axis is truncated at 95th percentile of the sum of the bars across models, rounded up to the nearest 10.

Evaluation by Week

In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states. In the legend, the models with a dot and line have scores for ever week, while the models with just a line are missing scores for at least one week.

For the figures, WIS is used as a metric, with the y axis truncated at the 97.5 percentile of the weekly average WIS. The first figure shows the mean WIS across all 50 states for submission weeks beginning October 14, 2023 at a 0 week horizon. The next 3 figures show the mean WIS aggregated across locations for 1, 2 and 3 week horizons. The last 4 figures show the empirical 95% PI coverage aggregated across locations for all horizons.

0 Week Horizon WIS

In this figure, the models with dashed lines are not included in the FluSight ensemble.

1 Week Horizon WIS

In this figure, the models with dashed lines are not included in the FluSight ensemble.

2 Week Horizon WIS

In this figure, the dotted black line represents the average 1 week ahead error across all models, as a “point of reference”. This shows that the scale of errors increases with larger horizons. The models with dashed lines are not included in the FluSight ensemble.

3 Week Horizon WIS

In this figure, the dotted black line represents the average 1 week ahead error across all models, as a “point of reference”. This shows that the scale of errors increases with larger horizons. The models with dashed lines are not included in the FluSight ensemble.

0 Week Horizon 95% PI Coverage

We would expect a well-calibrated model to have a value of 95% in this plot. In this figure, the models with dashed lines are not included in the FluSight ensemble. In this figure, the models with dashed lines are not included in the FluSight ensemble.

1 Week Horizon 95% PI Coverage

We would expect a well-calibrated model to have a value of 95% in this plot. There is typically larger error for the larger horizons compared to the 0 week horizon. In this figure, the models with dashed lines are not included in the FluSight ensemble.

2 Week Horizon 95% PI Coverage

We would expect a well-calibrated model to have a value of 95% in this plot. There is typically larger error for the larger horizons compared to the 0 week horizon. In this figure, the models with dashed lines are not included in the FluSight ensemble.

3 Week Horizon 95% PI Coverage

We would expect a well-calibrated model to have a value of 95% in this plot. There is typically larger error for the larger horizons compared to the 0 week horizon. In this figure, the models with dashed lines are not included in the FluSight ensemble.

Evaluation by location

This figures below show recent model performance stratified by location. We only included forecasts for the last 4 weeks. Models were included if they had submitted forecasts for all 5 horizons and submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Locations are sorted by cumulative hospitalization counts.

The color scheme shows the WIS score relative to the baseline, across all horizons. The only locations evaluated are 50 states, selected jurisdictions and the national level forecast. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons.

Evaluation Periods

This figure shows the number of weekly number of confirmed influenza hospital admissions reported in the US. The vertical blue line indicates the beginning of the “recent” model evaluation period. The vertical green line indicates the beginning of the “seasonal” model evaluation period.

Log-transformed counts

These evaluations are based on log-transformed counts, which was recommended by Bosse et al. (2023).

Summary Tables

These tables evaluate forecasts in the four most recent weeks, and historical accuracy for all forecasts submitted in the current season, based on log-transformed counts. The tables evaluate forecasts based on their WIS and MAE, overall and by horizon.

Inclusion criteria for each column are detailed below the table.

Recent accuracy

To calculate each column in our table, different inclusion criteria were applied. This table only includes forecasts for the last 4 weeks, since December 23, 2023. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.

Historical accuracy

To calculate each column in the table, different inclusion criteria were applied. This table includes forecasts for the last 14 weeks, since October 14, 2023. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.

WIS components

The data in this graph has been aggregated over all locations and submission weeks. We only included forecasts for the last 4 weeks. The models included have submitted at least 50% of forecasts during this time. This is the same exclusion criteria applied for WIS scores in the recent evaluation period.

The sum of the bars adds up to the WIS score. Of note, these values may not be exactly the same as the relative WIS scores shown in the leaderboard table because these are not adjusted for weeks or locations missing. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons. The y axis is truncated at 95th percentile of the sum of the bars across models, rounded up to the nearest 10.

Evaluation by Week

In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states. In the legend, the models with a dot and line have scores for ever week, while the models with just a line are missing scores for at least one week.

For the first 2 figures, WIS is used as a metric, with the y axis truncated at the 97.5 percentile of the weekly average WIS. The first figure shows the mean WIS across all 50 states for submission weeks beginning October 14, 2023 at a 0 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 2 week horizon.

0 Week Horizon WIS

In this figure, the models with dashed lines are not included in the FluSight ensemble.

1 Week Horizon WIS

In this figure, the models with dashed lines are not included in the FluSight ensemble.

2 Week Horizon WIS

In this figure, the dotted black line represents the average 1 week ahead error across all models, as a “point of reference”. This shows that the scale of errors increases with larger horizons. The models with dashed lines are not included in the FluSight ensemble.

3 Week Horizon WIS

In this figure, the dotted black line represents the average 1 week ahead error across all models, as a “point of reference”. This shows that the scale of errors increases with larger horizons. The models with dashed lines are not included in the FluSight ensemble.

Evaluation by location

This figures below show recent model performance stratified by location. We only included forecasts for the last 4 weeks. Models were included if they had submitted forecasts for all 5 horizons and submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Locations are sorted by cumulative hospitalization counts.

The color scheme shows the WIS score relative to the baseline, across all horizons. The only locations evaluated are 50 states, selected jurisdictions and the national level forecast. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons.

Evaluation Periods

This figure shows the number of weekly number of confirmed influenza hospital admissions reported in the US. The vertical blue line indicates the beginning of the “recent” model evaluation period. The vertical green line indicates the beginning of the “seasnal” model evaluation period.